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A Monte Carlo method is given to compute the binding affinity of a ligand to a protein. The method involves 
extending configuration space by a discrete variable indicating whether the ligand is bound to the protein and 
a special Monte Carlo move which allows transitions between the unbound and bound states. Provided that an 
accurate protein structure is given, that the protein-ligand binding site is known, and that an accurate chemical 
force field together with a continuum solvation model is used, this method provides a quantitative estimate of 
the free energy of binding. 
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Introduction 

Many drugs work by binding to a target protein in an or- 
ganism and affecting the action of this protein. The binding 
of the drug molecule, the ligand L, to the protein P under 
physiological conditions is usually reversible (characterized 
by weak chemical interactions rather than covalent bonds), 

L + P ^ LP, 

and, in equilibrium and in the dilute limit, the concentration of 
the ligand-protein complex [LP] is given by the dissociation 
constant 



[L][P] 
[LP] ' 



It is convenient to define the binding affinity as 

K d /N A 
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where Na is the Avogadro constant. A high value for pK d is 
crucial to obtaining a good drug molecule and, consequently, 
the ability to compute pK d accurately would greatly acceler- 
ate drug discovery by allowing many molecules to be screened 
in silico before any time-consuming syntheses and assays are 
done. The dissociation constant can also be expressed in terms 
of the binding free energy AF, 



K d = exp(/?AF)/Vb, 



(1) 



where Vq is the system volume, (3 — l/(kT), T is the tem- 
perature, and k is the Boltzmann constant. Similarly pK d is 
given by 
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The quantity AF is the free energy difference of the ligand 
and the protein forming a bound complex LP, the "bound sys- 
tem," compared to the ligand and the protein isolated from one 
another L + P, the "unbound system." 



In order to compute pK d , we require (1) a sufficiently accu- 
rate model of the protein and the ligand and their interaction 
and (2) a good way to compute the resulting value of pK d . 
In this paper, we assume the first requirement is fulfilled and 
instead focus on meeting the second. 

The standard methods of computing free energies {HQ] are 
not capable of computing AF directly because the unbound 
and bound systems are too dissimilar, which hinders transi- 
tions between these systems. Instead, typically, two close lig- 
ands L a and Lb, are compared separately unbound and bound 
to the protein, thereby obtaining the difference in the free en- 
ergies AF a - AF b . 

We present here a practical method for directly comput- 
ing AF and hence pK d . The method consists of: (1) for- 
mulating the problem in an extended phase space which al- 
lows the unbound and bound systems to be treated as a single 
system and K d to be expressed as the ratio of two canoni- 
cal averages; (2) introducing a new Monte Carlo move, the 
"wormhole move," to make transitions between the unbound 
and bound states in this extended system; and (3) a method to 
determine the "portals" needed for the wormhole move. 



Formulation 

Consider a system of volume Vq consisting of a ligand mol- 
ecule L and a protein molecule P dissolved in N$ molecules 
S. The overall state of the system is given by [r, Tg] where 
r represents the phase space configuration of L and P and Fs 
represents the configurations of all the solvent molecules S. 
The energy of the system is given by F(r, Ts) and, in equi- 
librium, the system obeys the Boltzmann distribution |3] 



/(r,r 5 ) 



ex P [-/?F(r,r s )] 
J ex P [-/3F(r,rs)]drdr s ' 



It is frequently useful to average over the configurations of the 
solvent molecules by integrating the Boltzmann distribution 
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over Ts to give a reduced Boltzmann distribution 



/(r) = / /(r,r s )dr s 

exp[-^(r)] 



Jexp[-0E(T)] dY' 



where 



E(Y) 



0' 



In / ex P [-/3s(r,r s )]dr s 



is the energy of the system with ligand and protein configu- 
rations specified by Y and with the equilibrium effects of the 
solvent implicitly included as a solvation free energy. In this 
paper, we will assume that E(Y) is given. 

Typical molecular interactions have a short range. In view 
of this, let us define Eq to represent all accessible Y space 
(i.e., L and P somewhere in the system volume Vq), and Ei 
to represent that portion of So where there is an appreciable 
interaction between L and P which therefore form the com- 
plex LP. In the phase-space volume Ei we write the energy 
as Ei (r) which is just an alternate notation for the full energy 
E(Y), while in the volume Eo — Ei we may write the energy 
as Eq(Y) which we define as the "unbound" energy, i.e., E(Y) 
excluding the interaction between L and P. The dissociation 
constant can then be written as 



K d 



k -^^dY 



V | So e-^( r ) dT/ Ei e-^( r ) dY 



V a J Ei e-^i( r ) dY 



ho-Vi e~ PEo(r) dT + k ± e-^ £ i( r ) dY' 

In the dilute limit Vq — > oo, this can be simplified by ignor- 
ing the second term in the denominator of the last factor and 
by extending the limits of the integrals over Eo — Ei to in- 
clude Ei. In extending the definition of Eq(Y) to Y G Si, we 
include the intramolecular energy and the solvation free en- 
ergy but continue to omit the intermolecular (ligand-protein) 
energy. This yields (jH^lS 



1 J^exp[-PE (Y)}dY 
V J Si exphMi (T)]dT' 



(2) 



The Helmholtz free energy of the unbound and bound systems 
isH 

F A = - ilnQf exp[-0E x {T)] dY 

for A = and 1, and Eq. ([0 is obtained from Eq. with 
AF = F\ — Fq. The definition of Kd, Eq. (0, is strictly in- 
dependent of Vq because of translational symmetry (ignoring 



boundary effects). It is also independent of the precise defini- 
tion of Ei, provided that Ei includes the protein-ligand bind- 
ing site and does not include a "macroscopic" volume beyond 
this. 

In this formulation, we have assumed that the system vol- 
ume Vo is fixed. However, in most physiological systems, the 
pressure is held constant and the binding affinity is then re- 
lated to the differences in the Gibbs free energy which intro- 
duces a correction term which is the product of the pressure 
and the change in the volume caused by the formation of the 
LP complex |6]. We expect this correction to be small for 
typical ligand-protein interactions in solution. 

We would like to cast Eq. as the ratio of canonical av- 
erages which can be computed using the canonical-ensemble 
Monte Carlo method |7]. To achieve this, we combine the 
unbound and bound systems by extending phase space to in- 
clude a discrete index A € {0,1} and consider a system in 
this extended space with energy E X (Y) for which the canoni- 
cal average is defined by 

J2 x Jdrexp[-(3E* x (Y)}X x (Y) 



(X) = 



We take E* X (Y) to be infinite for Y 
Now Eq. can be rewritten as 



£ A /drex P [-/?£*(r)] 

Ea and finite otherwise. 



K,= 



1 (5 M exp(-f3[E Q (Y) ~ E* Q (Y)})) 
V (S X1 exp(-P[E 1 (Y)-E* 1 (Y)]))'- 



(3) 



where <5 Am is the Kronecker delta. If E X (Y) w E\(Y), the 
terms being averaged are 0(1). Because the definition of Kd 
is independent of Vo, we can pick Vo ~ 1/ Kd so that approx- 
imately the same number of samples contribute to each of the 
canonical averages. We show later, Eq. that this choice 
minimizes the error in the estimate of Kd- 

We can evaluate E X (Y) with short energy cutoffs allowing 
it to be computed more quickly than E\ (L) and the terms con- 
tributing to the averages in Eq. Q can be accumulated every 
hundred steps, for example. Since there is typically a high 
correlation between successive steps in a Monte Carlo sim- 
ulation, this method allows the averages to be computed to 
a given degree of accuracy in less time than if we had used 
E* X (Y) = E X (Y). 

The extension of phase space has been used in other free en- 
ergy calculations, to combine, for example, systems at several 
different temperatures 1 8] or to treat the "reaction coordinate" 
controlling the transition between two chemical species as a 
dynamic variable (SJUlI- In our case, the use of the worm- 
hole Monte Carlo (described in the next section) allows us to 
include just the two systems of interest without the need to 
compute the properties of (possibly unphysical) intermediate 
systems. 

Wormhole Monte Carlo 

We can compute the canonical averages in Eq. using the 
Monte Carlo method |J to make steps from [L, A] to [T 7 , A'] 
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FIG. 1: (a) Schematic representation of -Bo(r) (shallow and wide) 
and -Bi(r) (deep and narrow). Conventional Monte Carlo moves 
between A = and 1 (shown as dashed lines) are nearly always 
rejected because they lead to large increases in energy, (b) Schematic 
representation of typical portals for the wormhole moves for the case 
illustrated in (a), The large ratio of the volume of the unbound (A = 
0) portal compared to the bound (A = 1) portals compensates for the 
higher energy of the unbound configurations. This results in accepted 
wormhole moves (dashed lines) between all the portals. 



with probability 

min[l,exp(-/3[^,(r')-^(r)])]. 

However, the estimate of the ratio in Eq. (|3} will be very poor, 
because transitions between A = and 1 will be extremely 
rare — typically, Eq is shallow and wide, while E* is deep and 
narrow; see Fig.^a). One possible way of remedying this is 
to treat A as a continuous variable l^ flOII . providing a suitably 
interpolated definition of E\(T), and allowing Monte Carlo 
steps with small changes in A. In practice, this approach only 
"works" if the two endpoints are sufficiently similar, thus lim- 
iting this method to the comparison of chemically close mol- 
ecules. 

Here we propose an alternative way of carrying out a Monte 
Carlo simulation of the E^ (T) system. We restrict the stan- 
dard moves to changes in T only, and allow changes in A 
via "wormhole moves" lHUl which connect otherwise discon- 
nected regions of configuration space. This obviates the need 
to treat (possibly unphysical) intermediate values of A, per- 
mitting us to compute the free energy differences needed to 
determine the absolute binding affinity. 

Assume we have some system defined on a phase space T 



whose equilibrium distribution is proportional to g(T). The 
canonical average of a quantity X(T) is defined by 



(X) 



J (TCg(T)X(T) 
JdTg(T) ■ 



[r,A], 



on 



In our application, we make the identification T = 
I dT = ExJdT, and 5 (T) = exp[~l3E x (T)]. 

Let us define a set of "portal functions," w, w', w", 
T, with properties 

< w(T) < l/v < oo, 



rfTw(T) = 1, 



where v is a representative phase-space volume of the portal 
function. A wormhole move consists of the following steps: 
select a pair of portals (w, w') with probability p WW '\ reject 
the move with probability 1 — vw(T), where T is the current 
state; otherwise, with probability vw(T), pick a configuration 
T' with probability w'(T'); and accept the move to T' with 
probability P ww > (T, T'). If the move is rejected, T is retained 
as the new state. We determine P ww i(T, T') by demanding 
that the rate of making transitions from T to T' via portals 
(w, w' ) is balanced by the reverse rate from T' to T via portals 
(w' , w), i.e., 

Pww' (^3 ^ ) Rw'w J ^) 

(the condition of detailed balance), where the rates are given 
by 



S(T) 



/dT 5 (T) 

vw(T)w'(T')P ww ,(T,T')- 



A possible solution for the acceptance probability is 

D /TT tt'\ • h Pw'w <?(Y ) V 

P WW >{T, T ) = mm 1 



' Pww> g(T) v J ' 

where we have made use of the identity amin(l,a _1 ) = 
min(l, a), which is valid for a > 0. 

In order to apply this move, let us use a specific rather sim- 
ple form for the portal functions, w(T), namely 



w(T) 



l/v, for T e w, 
0, otherwise, 



where we now denote a portal by w which defines an arbi- 
trary subset of T space of volume v. In principle, there is 
no restriction on the choice of the portals; however, practical 
considerations, discussed below, dictate how they are chosen. 
Furthermore, for simplicity, we will assume that the wormhole 
probabilities are all equal, p ww > — const. 

Let us now describe the wormhole move in [r, A] space. 
Starting with a configuration [T, A], first pick a random portal 
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w. If [r, A] ^ w, then reject the move. Otherwise, pick a 
random configuration [r', A'] uniformly in a randomly chosen 
portal w', and accept the move to [V, A'] with probability 



mini 1 



exp[-PE* x (T)] v 



(4) 



This differs from the standard Boltzmann acceptance proba- 
bilty by the ratio of volumes, v'/v, which can be large enough 
to compensate for the difference in the mean energies in the 
portals. Indeed, if the mean energy of configurations in w 
scales as /3 _1 In w+ const., the acceptance probability, Eq. @, 
is O(l), thereby allowing moves between shallow wide wells 
and deep narrow ones; see Fig. \Jib). In practice, each por- 
tal will occupy one of the energy wells of either Eq(T) or 
E± (r). This implies that the A = portals will permit unre- 
stricted translation and rotation of the ligand and the protein 
subject to the Vo constraint, and the A = 1 portals will allow 
unrestricted motion of one of the molecules. 

The wormhole move embodies two concepts which have 
been used separately in other works: stretching or shrink- 
ing r space when making the move compensates for possibly 
large energy differences between wells II 2l 1 1 311 : and jump- 
ing between disconnected regions of phase space enables the 
Markov chain to explore regions of phase space separated by 
large energy barriers 1141 1 1 3fl . In the following sections, we 
will show how these elements may be combined to permit the 
computation of protein binding affinities with realistic force 
fields. 



Finding the portals 

In order for the wormhole method to be practical, we need a 
reliable way of choosing the portals. We describe this process 
first in the general case. Let us write T = [rL,Tp], where 
Tm = [^m, 2m] represents the state of molecule M, Xm rep- 
resents its position and orientation, and Sm represents its con- 
formation. 

We assume that Em is expressed in such a way that any con- 
straints on the positions of the atoms in M (e.g., bond lengths 
and bond angles) is implicitly accounted for, so that the di- 
mensionality of Sm reflects the number of degrees of confor- 
mational freedom, tim, for this molecule. 

Because of the translational and rotational symmetry of the 
system, certain components of T are ignorable. We can there- 
fore write £^(r) = E X (E\), with 

50 = [S L ,S P ], 

51 = [Y,3 L ,H P ], 

where Y = — Xp denotes the position and orientation of 
the ligand relative to the protein. The dimensionality of 2a is 
n\ with no = ?il + np and n\ = no + 6. 

The strategy for determining the portals is to carry out con- 
ventional canonical Monte Carlo simulations with El(E\) 



separately for A = and 1. For each A, we obtain a canon- 
ical set of configurations {E} (suppressing the A subscripts 
for brevity), to which we fit n-dimensional ellipsoids, as fol- 
lows. First we try to fit a single ellipsoid to {E}. The center 
of the ellipsoid is given by the mean configuration (H) . For 
each configuration in {E}, we determine the deviation from 
the mean, 5E = E — (S), and compute a co variance matrix 
for the configurations which can be diagonalized as 

(SE5E) = PAP T , 

where P is the matrix of (column) eigenvectors and A is the 
diagonal matrix of eigenvalues. Because of the properties of 
the covariance, P is real and orthogonal, P _1 = P T , and the 
eigenvalues are real and non-negative. If there are no hidden 
constraints on the motion, we additionally can assume that 
the eigenvalues are strictly positive. We find it convenient to 
define 



B = PA 1 / 2 , 



B -l =A -l/2pl 



so that we can write 



(SE6E) = BB 1 



We take the semi-axes of the ellipsoid to be the columns 
of y/nB. The multiplier here, ^/n, is chosen to ensure 
that O(l) of the configurations in {EE} lie within the ellip- 
soid. This choice is motivated by considering a symmetric 
ri-dimensional Gaussian 



for which we have 



exp(— \r 2 
(2tt)™/ 2 



l f{v)d n v = 



Ellipsoids are a natural choice to use to fit the set of configu- 
rations for the following reasons: (1) The iso-density contours 
of the distribution in a harmonic well are ellipsoids. (2) It is 
easy to sample points randomly from an ellipsoid. (3) Con- 
versely, it is easy to test that a point lies inside an ellipsoid. 
(4) The volume of an ri-dimensional ellipsoid is given by 



(n/2)! 



n 



a,. 



where at is the length of ith semi-axis. Note well the degener- 
ate case of this result, vq = 1, which is used if the protein and 
ligand are both rigid (n = 0). The sampling and testing of 
points, (2) and (3), can either be accomplished by transform- 
ing with the matrix B. Alternatively, we can use the simpler 
Cholesky decomposition of the covariance matrix 

(SESE) = CC T , 

where C is a lower triangular matrix. Both C and C _1 may be 
computed by direct (non-iterative) methods 1 16]. 
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Having defined an ellipsoid in this way, we test its suitabil- 
ity as a portal by demanding that O(l) of the configurations 
sampled uniformly from it have energies close to its mean en- 
ergy (i?*(H)). If this test fails, we split {EE} into two sets 
according to the sign of SB, projected along the largest semi- 
axis of the ellipsoid and construct new trial portals from each 
of these sets. 

The ellipsoids that result from this process constitute our 
portals. When computing the volumes of the portals for use 
in Eq. l|4}, we need to account for the freedom to place 2 — A 
molecules at arbitrary positions and orientations in the volume 
Vq. In practice, this means we multiply the volume of the 
A = portals by aVo where Vq is the translational volume 
and a is the orientational volume (given below). 

In order to complete the specification of the portals, we 
need to describe how (EE) and 6E are formed, since, as a 
consequence of working in the subspace where the molecu- 
lar constraints are implicitly satisfied, EE is not simply a point 
in K™. We wish to represent each ellipsoid on a locally Carte- 
sian space R" which lets us use familiar formulas for defining 
the ellipsoid. We also demand that the mapping to R" have 
constant Jacobian in order to maintain detailed balance when 
sampling from the ellipsoids. 

We illustrate this by considering the case of the protein and 
the ligand being made up of «m + 1 rigid fragments simply 
connected by um bonds each of which allow rotation only 
(i.e., the bond lengths and bond angles are fixed). The relative 
position and orientation of L with respect to P, Y, can be de- 
fined in terms of one of the rigid fragments of each molecule. 
Finally, we use unit quaternions 11711 to represent orientation. 
Since the quaternions q and — q represent the same rotation, 
orientations are given by an axis 11811 of the sphere 5* 3 . 

The coordinates making up EE are then: (a) the position 
component of Y, a point in R 3 ; (b) the orientation component 
of Y, an axis of the sphere S 3 ; and (c) the dihedral angles 
of the rotatable bonds of the ligand and protein, points on the 
circle S 1 . The definition of (S) and 5E is straightforward for 
(a), since we use the normal arithmetic definitions. For (c), 
we define the mean for each angle [ 18] by the direction of the 
mean of the points on S 1 embedded in R 2 . We form the devi- 
ation in this case by subtraction modulo 2tt so that the result 
lies in [— it, it}. 

To find the mean of the orientations (b), we similarly em- 
bed S 3 in M 4 . We define the mean orientation |18] as the 
axis in M 4 about which the moment of inertia of the sample 
axes is minimum. To find the deviation of the orientation, 
we compute the differential rotation 5q = q(q)* which takes 
the mean orientation to the sample orientation and we project 
this into a "turn" vector u in the unit ball in R 3 so as to pre- 
serve the metric. This is achieved by a generalization of the 
Lambert azimuthal equal-area projection as described in the 
Appendix. In this representation, the volume of orientational 
space (which contributes to the multiplier for the unbound vol- 
umes) is a = |-/t. 

Occasionally, the point sampled from w' corresponds to one 
of the coordinates "wrapping around," i.e., the change in the 



dihedral lies outside [—%, ir] or the turn u lies outside the unit 
ball. In order to preserve detailed balance, we reject the re- 
sulting move. 

There are four ways in which we can improve the qual- 
ity of the portals obtained by this method so as to make suc- 
cessful wormhole moves more likely. (1) The Monte Carlo 
runs used to obtain the samples from which the portals are 
defined should begin with an "annealing" phase where the 
temperature is started at some high value and slowly reduced 
to T at which point we start gathering samples. This al- 
lows the Monte Carlo sampling to explore phase space more 
thoroughly. (2) Other methods of finding conformational en- 
ergy minima 1 19] can be applied to provide additional starting 
points for the Monte Carlo runs. (3) The samples can be sup- 
plemented with those obtained by applying those symmetry 
operations which leave the molecules invariant. In the case 
of non-chiral ligands, we would also apply a reflection of the 
ligand in the unbound case since this will leave the energy un- 
changed. In this way, the portal moves allow all the symmetric 
variants of the molecules to be explored so that symmetry is 
included in a systematic way in the computation of the bind- 
ing affinity. (4) When forming (<5EEo <5EEo), we should set the 
intermolecular cross terms to zero because the conformations 
of the two molecules are independent when A = 0. 

This method of finding portals depends on the samples 
"spanning" a volume of phase space. This requires that the 
dimensionality of phase space be sufficiently small and this, 
in turn, implies the use of an implicit solvation model. In ad- 
dition the portals can be more reliably found if the "hard" de- 
grees of freedom in the molecules are replaced by constraints 
(e.g., by fixing the bond lengths and bond angles as described 
above). 

The method of successively subdividing the samples may 
lead to suboptimal portals in some cases, for example, by di- 
viding a contiguous set of samples. We have recently exper- 
imented with fitting a mixture of Gaussians to the samples 
using the expectation-maximization (EM) algorithm 1 2(1 \2 ill . 
Since this optimizes the fit to all the samples, it usually results 
in fewer, better, portals. In this case, we are naturally lead to 
use Gaussians for the portal functions rather than the more re- 
strictive ellipsoids. By enforcing the symmetries of the ligand 
when making the fits, ligand symmetry can also be included 
in a rigorous way. These improvements will be described in a 
subsequent publication. 

The free energy calculation 

Prior to the calculation of the free energy, we estimate a 
suitable value of Vq, which enters into the definition of the 
volumes of the A — portals, by assigning an estimated sta- 
tistical weight of v exp(— j3(E*)) to each portal and estimat- 
ing 

°~ £A=o(»/^)exp(-/W)' 
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FIG. 2: The structure of p-amino-benzamidine. 

where the sums in numerator (denominator) are over the 
bound (unbound) portals and v/Vq for the unbound portals 
is the volume of the ellipsoid (multiplied by a) and thus does 
not depend on Vb- If this estimate for Vq results in the sam- 
pling being too heavily weighted toward A = or 1, Vq may 
be changed and the current values of the sample sums for Kd 
can be adjusted to account for this change. 

The choice of the starting configuration for the free energy 
calculation may introduce some bias in the results. We can 
remove much of this bias by using the portals to select the 
starting configuration: select a portal w with probability pro- 
portional to its estimated statistical weight; select a configura- 
tion, [r, A], uniformly from this portal; and accept the config- 
uration with probability 

min[l,exp(-/3[^(r) -(£*>])], 

where (E*) is the canonical average energy over the portal. 
This procedure is repeated until a configuration is accepted. 
The bias can be further reduced by running the Monte Carlo 
calculation for several correlation times, defined by Eq. (jfji, 
prior to gathering the data for Eq. Q. 

During the course of the free energy calculation, normal 
and wormhole Monte Carlo moves are mixed. With the nor- 
mal moves, we only attempt to change T and, for this reason, 
it is possible to have the Monte Carlo step size depend on 
A (with, usually, the step size being larger with A = 0). In 
practice, most (~ 90%) of the attempted moves are worm- 
hole moves because frequently we have [T, A] ^ w and the 
attempted wormhole move is inexpensively rejected. 

The method is robust in the sense that it does not depend 
on the particular form of the energy function. In addition, a 
failure to make transitions between A — and 1 can be de- 
tected. This may be because the test [T, A] S W never suc- 
ceeds (i.e., the configuration has been trapped in a new well), 
or because the acceptance criterion is never met, which indi- 
cates that there is a deep well within the well of one of the 
portals. In both cases, the problem can be corrected by adding 
a new portal based on recent configurations. 

Example 

The efficacy of the wormhole method depends on how fre- 
quently a configuration lies within one of the portals and how 
often the jump to the new portal is accepted. In order to assess 
these questions, we have computed the binding affinity of p- 
amino-benzamidine, whose structure is shown in Fig. [2] to the 
digestive enzyme trypsin, at T — 290 K. We emphasize that 



the primary goal of this exercise is to assess how well worm- 
hole moves allow the free energy calculation to converge. For 
this purpose, we are not so interested in comparing the result- 
ing computed binding affinity to the experimental data since 
this will depend in large measure on the accuracy of the force 
field and of the protein structure. Nevertheless, since the con- 
vergence will depend on the complexity of the energy "land- 
scape," we use this example to epitomize the binding of a 
small ligand to a protein. 

The coordinates of the atoms in trypsin are taken from a 
trypsin-benzamidine complex, 1BTY 12211 . At physiologi- 
cal pH, the amidine group is protonated (net charge of +1) 
and, in the complex, it is attracted to a negatively charged 
aspartate residue in trypsin inhibiting its enzymatic action. 
We employ the Amber 7 force field Q and the 
GB/SA solvation model ^ E3 El. The protein is taken 
to be rigid, np = 0, and two bonds of the ligand are al- 
lowed to rotate, namely, those connecting the amidine and 
the amino groups to the benzene ring, — 2. The pub- 
lished force field 12 311 does not provide a satisfactory tor- 
sion for the bond between the benzene ring and the amidine 
group and this term was determined using Gaussian 98 1 29] as 
[-14.2 cos(20) + 3.3 cos(4</>) + 0.5 cos(6</>)] kJ/mol, where 
<f> is the dihedral angle. 

Five unbound and five bound canonical simulations of 1000 
steps each were carried out to find the portals. The resulting 
configurations were supplemented by those obtained by ap- 
plying the symmetry operations which leave the ligand invari- 
ant. This gave 16 unbound and 8 bound portals. The config- 
urations of the bound portals are all the same "pose" of the 
ligand on the protein and correspond to the symmetries of p- 
amino-benzamidine given by rotating the amino, benzene, and 
amidine groups by 180°. The unbound ligand also exhibits 8 
symmetries given by rotating the amino and amidine groups 
by 180° relative to the central benzene ring and by including 
the mirror images. However since the bond parameters allow 
partially free rotation of the amidine group, each symmetric 
set of configurations is represented by two portals. 

We estimated a suitable value for Vq of 0.39 x 10~ 18 m 3 
based on the volumes and the mean energies of the com- 
puted portals. During the binding affinity calculation, worm- 
hole moves were attempted on 90% of the steps (the other 
10% were standard Monte Carlo moves). Of these attempted 
wormhole moves, 97% failed (inexpensively) because the 
configuration was not in the chosen w. Of the remaining 3%, 
about 60% lead to a successful move, of which about 40% in- 
volved switching from A = to 1 and vice versa. The major 
computational cost in the free energy calculation is the eval- 
uation of the bound energies. In this example, the wormhole 
moves required less than 3 bound energy evaluations, on av- 
erage, to effect the transition from a bound configuration to 
an unbound one and back. This enables an accurate estimate 
to be made of the ratio of the averages in Eq. which after 
5 x 10 6 steps yields pKd — 7.99 ± 0.01. The error estimate is 
derived in the next section and represents a 2% relative error 
in K d . 
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FIG. 3: Cumulative estimates pKd(s) obtained by sampling every 
100th step from the first s steps of 5 independent Monte Carlo runs. 
The dashed lines shows convergence as 1/y/s to the mean value of 
7.99. 



This example illustrates that the method is effective at al- 
lowing sufficient transitions between the bound and unbound 
states to enable the binding affinity of protein-ligand systems 
to be accurately computed. We note that our computed bind- 
ing affinity differs from the experimental results of 5.1 to 5.2 
I3(ll3lll . This discrepancy may be accounted for by modest, 
~ 20%, errors in the force field. 



Error analysis 

In order to assess the errors in the computation of pKd in 
more detail, we carried out 10 independent runs similar to the 
one described in previous section. Each of these used the same 
portals and the same value of Vq. We computed cumulative 
estimates for pKd based on the first s steps of the Markov 
chains. When forming the averages in Eq. Q we sample every 
100th step. (As we shall see, there is a high degree of correla- 
tion within 100 steps; thus there would be little improvement 
in the estimate of pKd by sampling more frequently.) The re- 
sults for 5 of these runs are shown in Fig. [3] The convergence 
toward the mean is as l/\fs; after 5 x 10 6 steps, the error in 
pKd has been reduced to about ±0.01. 

For the purposes of further analysis, let us assume that the 
computation is carried out with E^(T) = E\(T) so that all 
samples have the same statistical weight. The probability that 
the system is in the bound (resp. unbound) state is p = (5\i) 
(resp. q = 1 — p = (S\q)) and Eq. becomes 



V p 



(5) 



The determination of Kd is then equivalent to estimating q/p 
by taking the ratio of the number of unbound and bound steps 
in the Monte Carlo simulation. How well this estimate con- 



FIG. 4: The A-correlation function Ct- The curves show Ct for 
Vo/(0.39 x 10~ 18 m 3 ) = (a) i, (b) §, (c) 1, (d) 3, and (e) 10. The 
ratio q/p varies correspondingly (from approximately | to ^jp). The 
correlation times are (a) 329, (b) 346, (c) 352, (d) 382, and (e) 415. 
In determining the correlation times, we limit the sum in Eq. |6| to 
< t < 2500 in order to avoid including the small but noisy terms 
for large t. 



verges depends, naturally, on the "correlation time" of the sys- 
tem. If wormhole moves rarely cause the system to switch be- 
tween bound and unbound states, the correlation time is large 
and the convergence will be slow. 

In order to make these ideas quantitative, we define the A- 
correlation function, 

C t = ((A s -p)(X 8+t -p)) 8 , 

where A s is the value of A at simulation step s and (. . .) s de- 
notes an average over steps. Figure [4] shows Ct for several 
different values of V®. (From Eq. Q, we have q/p oc V$.) 
In a Markov chain of length s, the expected number of bound 
states is ps, while, for s — > oo, the variance in the number of 
bound states is 2Ds, where 



D 



;C 



t>o 



Ct = -C t. 



(6) 



This provides us with the definition of the correlation time, r. 
From Fig. @] we see that Ct decays approximately exponen- 
tially so that the sum converges. 

We can compare the Monte Carlo simulation to a simpler 
Markov process of independent trials, e.g., tossing a coin 
where the probability of heads is p. In this case, we have 
Ct>o = and D = |Co = \pq. In the limit s — ► oo, 
the relative errors in estimating p for the Monte Carlo simu- 
lation will be the same as that for s/t tosses of the coin. We 
can illustrate this by making 5 independent simulations of a 
coin tossing experiment to match the data in Fig. [31 for which 
p = 0.443 and r = 352. The results are given in Fig.|5j where 
we plot h n /t n against n where h n (resp. t n ) is the number of 
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FIG. 5: Estimation of the bias of a coin with p = 0.443. Five 
independent runs are shown. In each run, the coin was tossed 5 x 
10 6 /r = 14200 times, where r is the correlation time for Fig.|4jc). 
and the cumulative value of h n /t n is recorded. The expected value, 
p/q, is shown as a dashed line. The axes have been adjusted to allow 
the figure to be directly compared with Fig. [3] 

heads (resp. tails) after n throws. As expected, Figs. [3] and|5] 
both exhibit the same convergence behavior. 

Since the distribution of outcomes h n in the coin tossing 
experiment is the binomial distribution, the mean and variance 
for l n = ln(h„ /t n ) can be calculated, yielding 

dn)=U P -) + U P -- q -) + 0(n-% 
\qj 2n\q pj 

((l n -(L)) 2 ) = — + 0(n- 2 ). 
npq 

The standard error in the estimate of pKj from a Monte Carlo 
run with s steps is found by substituting n = s/t and scal- 
ing by ln 10 (since pKg is defined in terms of the common 
logarithm) to give 




In the case of the simulations shown in Fig. |5J we find 
ApKd ~ 0.01 consistent with the data in the figure. From 
Fig. |4j we see that r is weakly dependent on p. Thus for a 
given s, we minimize the error by choosing p = q = i. Al- 
ternatively, we may wish to minimize the error for a given 
amount of computational effort. If bound steps are / times 
more expensive to carry out than unbound steps, the error is 
minimized by taking q/p = \f] , i.e., p = 1/(1 + y/J). 

Discussion 



and bound systems to be treated as a single canonical system 
and Kd to be expressed as the ratio of canonical averages, 
Eq. Q; (2) the wormhole move which allows transitions be- 
tween the bound and unbound systems; and (3) the use of an 
implicit solvation model which reduces the number of degrees 
of freedom and so allows portals to be identified. 

A method similar to ours is "simulated mutational equili- 
bration," which employs an extended phase space, uses 
an implicit solvation model, and allows jumps between differ- 
ent systems. However, in place of the wormhole move, this 
work employed a more restrictive "jumping between wells" 
move, which limited its applicability to computing the differ- 
ence in the binding affinities for two enantiomers. In contrast, 
our wormhole method can be used to compute directly the 
binding affinity of a ligand with several rotatable bonds to a 
protein target. This allows us to study a wide range of inter- 
esting drug-like ligands. 

In the special case of binding rigid molecules (for which 
we have Eq (T) — 0), wormhole Monte Carlo is isomorphic 
to a grand canonical Monte Carlo simulation 1 33] . Consider a 
grand canonical system with a single protein molecule and 
with ligand molecules at a fixed chemical potential p; we 
make the additional restriction that the ligand-ligand interac- 
tion energy is infinite, so that the system can only accommo- 
date or 1 ligand molecule. For the wormhole simulation, 
we specify the bound portal to include the full simulation vol- 
ume with arbitrary orientation; the unbound port is degenerate 
and corresponds merely to specifying A = 0. The wormhole 
moves from A = to 1 and vice versa correspond to particle 
insertion and deletion in the grand canonical simulation; and 
we can verify the acceptance probabilities are the same. 

The application described here can be extended by allow- 
ing a greater degree of flexibility for the protein and the lig- 
and. This permits the treatment of side-chain rotation and 
ligand-induced loop movement on the part of the protein and 
the treatment of flexible rings for the ligand. Many stan- 
dard Monte Carlo techniques can be used with this method, 
if appropriate — preferential sampling, early rejection, force 
bias, etc. Wormhole moves could also be used to treat other 
discrete, or nearly discrete, transitions. Examples are: the 
treatment of molecules, such as cyclohexane, which can as- 
sume distinct conformations; discrete sets of side chain rota- 
tions in the protein; protonation and tautomerization states for 
either molecule. In each of these cases, the acceptance prob- 
ability for the transitions should account for the free energy 
difference between the discrete states. 
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Appendix: Generalized Lambert projection 

The Lambert azimuthal equal-area projection projects a 
point on S 2 to a point on a disk in R 2 . Here, we will gen- 
eralize this to arbitrary dimensions, i.e., we will find a projec- 
tion from S" to a ball in R™. (Thus the circle S 1 is projected 
into a line segment; the surface of a sphere S 2 is project into 
a disk; S 3 is projected into a sphere; etc.) The projection is 
azimuthal, so that directions from the pole are preserved in the 
projected space. 

The sequence S n can be defined recursively by 

S° = [±1], 

S n = [cos 9, sinflS'"- 1 ], 

where is the colatitude and < 9 < it. The area of S n lying 
between 9 and 9 + d9 is therefore 



dA 



_isin n_1 9 dO, 



where a n is the area of S n . We project that portion of S n 
with colatitude in [0, 9] to a ball in R" of radius t. Equating 
the measures (area on S n and volume in R"), we obtain 



Vn,t dn 



sin 



where v n is the volume of a unit ball in 

d(v n f 



dt 



x 9'd9' 

R™. Using the relation, 

= nv n . 



we obtain 



t = 



n I sin 

o 



9'd9' 



l/n 



(8) 



The general "equal-area" projection is given by the mapping 



[cos 9, sin 9 S 

where t £ R", t is given by the point on and t is given 

by Eq. (|8}. Some special cases of Eq. (|8} are 



t 



2 sin if 
[1(2*- 



,1/3 



for n = 1, 
for n = 2, 

for n = 3, 

for rt = 4. 



sin 20)]" 
2sini0[i(l + 2cos 2 i0)] 1/4 , 

The case n = 1 corresponds to unwrapping a circle onto a 
line; and n — 2 gives the Lambert azimuthal equal-area pro- 
jection. We are interested in the case n — 3 as a way of 
mapping orientations from unit quaternion space to R 3 . A 
quaternion q = [cos 0, sin u] represents a rotation of ifj — 20 
about u. Noting that q and — q represent the same rotation, we 
may make the restriction < < \~n . We choose, there- 
fore, to map this hemisphere of S 3, onto u in the unit ball, by 

1/3 

making the substitutions = ^ip and t = (§7r) u to give 



tp — sin tp 



1/3 



(The other hemisphere, corresponding to ^tt < 9 < it, maps 
onto the shell 1 < u < 2 1 / 3 .) We call this u representation of 
orientations "turn space." 
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